AITopics

2511.14307

Country: Europe (0.28)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.72)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

R, Ezhini Rasendiran, Maurya, Chandresh Kumar

Improving Bird Classification with Primary Color Additives

arXiv.org Artificial IntelligenceJul-25-2025

We address the problem of classifying bird species using their song recordings, a challenging task due to environmental noise, overlapping vocalizations, and missing labels. Existing models struggle with low-SNR or multi-species recordings. We hypothesize that birds can be classified by visualizing their pitch pattern, speed, and repetition, collectively called motifs. Deep learning models applied to spectrogram images help, but similar motifs across species cause confusion. To mitigate this, we embed frequency information into spectrograms using primary color additives. This enhances species distinction and improves classification accuracy. Our experiments show that the proposed approach achieves statistically significant gains over models without colorization and surpasses the BirdCLEF 2024 winner, improving F1 by 7.3%, ROC-AUC by 6.2%, and CMAP by 6.6%. These results demonstrate the effectiveness of incorporating frequency information via colorization.

artificial intelligence, deep learning, machine learning, (14 more...)

2507.18334

Country: Asia (0.28)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

La Quatra, Moreno, Koudounas, Alkis, Vaiani, Lorenzo, Baralis, Elena, Cagliero, Luca, Garza, Paolo, Siniscalchi, Sabato Marco

Benchmarking Representations for Speech, Music, and Acoustic Events

arXiv.org Artificial IntelligenceMay-1-2024

Limited diversity in standardized benchmarks for evaluating audio representation learning (ARL) methods may hinder systematic comparison of current methods' capabilities. We present ARCH, a comprehensive benchmark for evaluating ARL methods on diverse audio classification domains, covering acoustic events, music, and speech. ARCH comprises 12 datasets, that allow us to thoroughly assess pre-trained SSL models of different sizes. ARCH streamlines benchmarking of ARL techniques through its unified access to a wide range of domains and its ability to readily incorporate new datasets and models. To address the current lack of open-source, pre-trained models for non-speech audio, we also release new pre-trained models that demonstrate strong performance on non-speech datasets. We argue that the presented wide-ranging evaluation provides valuable insights into state-of-the-art ARL methods, and is useful to pinpoint promising research directions.

benchmark, dataset, representation, (12 more...)

2405.00934

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)

arXiv.org Artificial IntelligenceApr-20-2024

Double Mixture: Towards Continual Event Detection from Speech

Kang, Jingqi, Wu, Tongtong, Zhao, Jinming, Wang, Guitao, Wei, Yinwei, Yang, Hao, Qi, Guilin, Li, Yuan-Fang, Haffari, Gholamreza

Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events. Traditional ASR systems often overlook the interplay between these events, focusing solely on content, even though the interpretation of dialogue can vary with environmental context. This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events. We introduce a new task, continual event detection from speech, for which we also provide two benchmark datasets. To address the challenges of catastrophic forgetting and effective disentanglement, Figure 1: In continual learning, learners incrementally acquire we propose a novel method, 'Double Mixture.' This method merges new event types and must evaluate all previously speech expertise with robust memory mechanisms to enhance learned types during testing. This process is particularly adaptability and prevent forgetting. Our comprehensive experiments challenging in speech-based scenarios due to the complex interplay show that this task presents significant challenges that are of semantic content (semantic event) and background not effectively addressed by current state-of-the-art methods in either sounds (acoustic event).

dataset, event detection, speech, (12 more...)

2404.13289

Country:

Oceania > Australia > Victoria > Melbourne (0.05)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
(5 more...)

Genre: Research Report > Promising Solution (0.68)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Neural Information Processing SystemsMar-14-2024, 15:36:46 GMT

Hierarchical spike coding of sound

Natural sounds exhibit complex statistical regularities at multiple scales. Acoustic events underlying speech, for example, are characterized by precise temporal and frequency relationships, but they can also vary substantially according to the pitch, duration, and other high-level properties of speech production. Learning this structure from data while capturing the inherent variability is an important first step in building auditory processing systems, as well as understanding the mechanisms of auditory perception. Here we develop Hierarchical Spike Coding, a two-layer probabilistic generative model for complex acoustic structure. The first layer consists of a sparse spiking representation that encodes the sound using kernels positioned precisely in time and frequency. Patterns in the positions of first layer spikes are learned from the data: on a coarse scale, statistical regularities are encoded by a second-layer spiking representation, while fine-scale structure is captured by recurrent interactions within the first layer. When fit to speech data, the second layer acoustic features include harmonic stacks, sweeps, frequency modulations, and precise temporal onsets, which can be composed to represent complex acoustic events. Unlike spectrogram-based methods, the model gives a probability distribution over sound pressure waveforms. This allows us to use the second-layer representation to synthesize sounds directly, and to perform model-based denoising, on which we demonstrate a significant improvement over standard methods.

kernel, representation, spike, (16 more...)

Country:

North America > United States > New York (0.05)
North America > United States > Nevada (0.04)

Industry: Health & Medicine (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsApr-6-2023, 15:27:09 GMT

Affine Structure From Sound

We consider the problem of localizing a set of microphones together with a set of external acoustic events (e.g., hand claps), emitted at un- known times and unknown locations. We propose a solution that ap- proximates this problem under a far field approximation defined in the calculus of affine geometry, and that relies on singular value decompo- sition (SVD) to recover the affine structure of the problem. We then define low-dimensional optimization techniques for embedding the solu- tion into Euclidean geometry, and further techniques for recovering the locations and emission times of the acoustic events. The approach is use- ful for the calibration of ad-hoc microphone arrays and sensor networks.

acoustic event, affine structure, geometry

Technology: Information Technology > Artificial Intelligence (0.50)

arXiv.org Artificial IntelligenceJul-21-2022

A Proposal for Foley Sound Synthesis Challenge

Choi, Keunwoo, Oh, Sangshin, Kang, Minsung, McFee, Brian

We during post-production to enhance its perceived acoustic properties, review recent machine learning challenges in audio, speech, and e.g., by simulating the sounds of footsteps, ambient environmental music research in Section 2 and existing works and datasets in Section sounds, or visible objects on the screen. While foley is traditionally 3. In Section 4, we provide a proposal for foley sound synthesis produced by foley artists, there is increasing interest in automatic challenge that includes problem definition, datasets, and evaluation or machine-assisted techniques building upon recent advances in metrics. We conclude the paper in Section 5. sound synthesis and generative models. To foster more participation in this growing research area, we propose a challenge for automatic 2. CASE STUDY: RESEARCH CHALLENGES foley synthesis. Through case studies on successful previous challenges in audio and machine learning, we set the goals of In this section, we review five existing research challenges: Blizzard the proposed challenge: rigorous, unified, and efficient evaluation Challenge, CHiME, DCASE, Music Demixing challenge, and of different foley synthesis systems, with an overarching goal of AI Song Contest. The former three are relatively mature while the drawing active participation from the research community. We outline latter two started after 2020. All of them started along with the increasing the details and design considerations of a foley sound synthesis popularity of the research problems and have contributed challenge, including task definition, dataset requirements, and evaluation to the continued growth by defining the tasks, providing common criteria.

artificial intelligence, machine learning, natural language, (19 more...)

2207.1076

Country:

North America > United States > New York (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre:

Research Report (0.50)
Overview (0.48)

Industry:

Leisure & Entertainment (0.69)
Media > Music (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.91)
Information Technology > Artificial Intelligence > Speech (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Karklin, Yan, Ekanadham, Chaitanya, Simoncelli, Eero P.

Hierarchical spike coding of sound

Neural Information Processing SystemsDec-31-2012

Natural sounds exhibit complex statistical regularities at multiple scales. Acoustic eventsunderlying speech, for example, are characterized by precise temporal and frequency relationships, but they can also vary substantially according to the pitch, duration, and other high-level properties of speech production. Learning this structure from data while capturing the inherent variability is an important first step in building auditory processing systems, as well as understanding the mechanisms of auditory perception. Here we develop Hierarchical Spike Coding, a two-layer probabilistic generative model for complex acoustic structure. The first layer consists of a sparse spiking representation that encodes the sound using kernelspositioned precisely in time and frequency. Patterns in the positions of first layer spikes are learned from the data: on a coarse scale, statistical regularities areencoded by a second-layer spiking representation, while fine-scale structure is captured by recurrent interactions within the first layer. When fit to speech data, the second layer acoustic features include harmonic stacks, sweeps, frequency modulations, and precise temporal onsets, which can be composed to represent complex acoustic events. Unlike spectrogram-based methods, the model gives a probability distribution over sound pressure waveforms. This allows us to use the second-layer representation to synthesize sounds directly, and to perform model-based denoising, on which we demonstrate a significant improvement over standard methods.

artificial intelligence, machine learning, spike, (19 more...)

Country: North America > United States (0.28)

Industry: Health & Medicine (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsDec-31-2006

Affine Structure From Sound

Thrun, Sebastian

We consider the problem of localizing a set of microphones together with a set of external acoustic events (e.g., hand claps), emitted at unknown times and unknown locations. We propose a solution that approximates this problem under a far field approximation defined in the calculus of affine geometry, and that relies on singular value decomposition (SVD) to recover the affine structure of the problem. We then define low-dimensional optimization techniques for embedding the solution into Euclidean geometry, and further techniques for recovering the locations and emission times of the acoustic events. The approach is useful for the calibration of ad-hoc microphone arrays and sensor networks.

acoustic event, gradient descent, sensor, (12 more...)